1,265 research outputs found
SurfelNeRF: Neural Surfel Radiance Fields for Online Photorealistic Reconstruction of Indoor Scenes
Online reconstructing and rendering of large-scale indoor scenes is a
long-standing challenge. SLAM-based methods can reconstruct 3D scene geometry
progressively in real time but can not render photorealistic results. While
NeRF-based methods produce promising novel view synthesis results, their long
offline optimization time and lack of geometric constraints pose challenges to
efficiently handling online input. Inspired by the complementary advantages of
classical 3D reconstruction and NeRF, we thus investigate marrying explicit
geometric representation with NeRF rendering to achieve efficient online
reconstruction and high-quality rendering. We introduce SurfelNeRF, a variant
of neural radiance field which employs a flexible and scalable neural surfel
representation to store geometric attributes and extracted appearance features
from input images. We further extend the conventional surfel-based fusion
scheme to progressively integrate incoming input frames into the reconstructed
global neural scene representation. In addition, we propose a highly-efficient
differentiable rasterization scheme for rendering neural surfel radiance
fields, which helps SurfelNeRF achieve speedups in training and
inference time, respectively. Experimental results show that our method
achieves the state-of-the-art 23.82 PSNR and 29.58 PSNR on ScanNet in
feedforward inference and per-scene optimization settings, respectively.Comment: To appear in CVPR 202
ID-Pose: Sparse-view Camera Pose Estimation by Inverting Diffusion Models
Given sparse views of an object, estimating their camera poses is a
long-standing and intractable problem. We harness the pre-trained diffusion
model of novel views conditioned on viewpoints (Zero-1-to-3). We present
ID-Pose which inverses the denoising diffusion process to estimate the relative
pose given two input images. ID-Pose adds a noise on one image, and predicts
the noise conditioned on the other image and a decision variable for the pose.
The prediction error is used as the objective to find the optimal pose with the
gradient descent method. ID-Pose can handle more than two images and estimate
each of the poses with multiple image pairs from triangular relationships.
ID-Pose requires no training and generalizes to real-world images. We conduct
experiments using high-quality real-scanned 3D objects, where ID-Pose
significantly outperforms state-of-the-art methods.Comment: 7 pages. Github: https://xt4d.github.io/id-pose
OmniZoomer: Learning to Move and Zoom in on Sphere at High-Resolution
Omnidirectional images (ODIs) have become increasingly popular, as their
large field-of-view (FoV) can offer viewers the chance to freely choose the
view directions in immersive environments such as virtual reality. The M\"obius
transformation is typically employed to further provide the opportunity for
movement and zoom on ODIs, but applying it to the image level often results in
blurry effect and aliasing problem. In this paper, we propose a novel deep
learning-based approach, called \textbf{OmniZoomer}, to incorporate the
M\"obius transformation into the network for movement and zoom on ODIs. By
learning various transformed feature maps under different conditions, the
network is enhanced to handle the increasing edge curvatures, which alleviates
the blurry effect. Moreover, to address the aliasing problem, we propose two
key components. Firstly, to compensate for the lack of pixels for describing
curves, we enhance the feature maps in the high-resolution (HR) space and
calculate the transformed index map with a spatial index generation module.
Secondly, considering that ODIs are inherently represented in the spherical
space, we propose a spherical resampling module that combines the index map and
HR feature maps to transform the feature maps for better spherical correlation.
The transformed feature maps are decoded to output a zoomed ODI. Experiments
show that our method can produce HR and high-quality ODIs with the flexibility
to move and zoom in to the object of interest. Project page is available at
http://vlislab22.github.io/OmniZoomer/.Comment: Accepted by ICCV 202
Dream3D: Zero-Shot Text-to-3D Synthesis Using 3D Shape Prior and Text-to-Image Diffusion Models
Recent CLIP-guided 3D optimization methods, such as DreamFields and
PureCLIPNeRF, have achieved impressive results in zero-shot text-to-3D
synthesis. However, due to scratch training and random initialization without
prior knowledge, these methods often fail to generate accurate and faithful 3D
structures that conform to the input text. In this paper, we make the first
attempt to introduce explicit 3D shape priors into the CLIP-guided 3D
optimization process. Specifically, we first generate a high-quality 3D shape
from the input text in the text-to-shape stage as a 3D shape prior. We then use
it as the initialization of a neural radiance field and optimize it with the
full prompt. To address the challenging text-to-shape generation task, we
present a simple yet effective approach that directly bridges the text and
image modalities with a powerful text-to-image diffusion model. To narrow the
style domain gap between the images synthesized by the text-to-image diffusion
model and shape renderings used to train the image-to-shape generator, we
further propose to jointly optimize a learnable text prompt and fine-tune the
text-to-image diffusion model for rendering-style image generation. Our method,
Dream3D, is capable of generating imaginative 3D content with superior visual
quality and shape accuracy compared to state-of-the-art methods.Comment: Accepted by CVPR 2023. Project page:
https://bluestyle97.github.io/dream3d
PanoGRF: Generalizable Spherical Radiance Fields for Wide-baseline Panoramas
Achieving an immersive experience enabling users to explore virtual
environments with six degrees of freedom (6DoF) is essential for various
applications such as virtual reality (VR). Wide-baseline panoramas are commonly
used in these applications to reduce network bandwidth and storage
requirements. However, synthesizing novel views from these panoramas remains a
key challenge. Although existing neural radiance field methods can produce
photorealistic views under narrow-baseline and dense image captures, they tend
to overfit the training views when dealing with \emph{wide-baseline} panoramas
due to the difficulty in learning accurate geometry from sparse
views. To address this problem, we propose PanoGRF, Generalizable Spherical
Radiance Fields for Wide-baseline Panoramas, which construct spherical radiance
fields incorporating scene priors. Unlike generalizable radiance
fields trained on perspective images, PanoGRF avoids the information loss from
panorama-to-perspective conversion and directly aggregates geometry and
appearance features of 3D sample points from each panoramic view based on
spherical projection. Moreover, as some regions of the panorama are only
visible from one view while invisible from others under wide baseline settings,
PanoGRF incorporates monocular depth priors into spherical depth
estimation to improve the geometry features. Experimental results on multiple
panoramic datasets demonstrate that PanoGRF significantly outperforms
state-of-the-art generalizable view synthesis methods for wide-baseline
panoramas (e.g., OmniSyn) and perspective images (e.g., IBRNet, NeuRay)
MonoNeuralFusion: Online Monocular Neural 3D Reconstruction with Geometric Priors
High-fidelity 3D scene reconstruction from monocular videos continues to be
challenging, especially for complete and fine-grained geometry reconstruction.
The previous 3D reconstruction approaches with neural implicit representations
have shown a promising ability for complete scene reconstruction, while their
results are often over-smooth and lack enough geometric details. This paper
introduces a novel neural implicit scene representation with volume rendering
for high-fidelity online 3D scene reconstruction from monocular videos. For
fine-grained reconstruction, our key insight is to incorporate geometric priors
into both the neural implicit scene representation and neural volume rendering,
thus leading to an effective geometry learning mechanism based on volume
rendering optimization. Benefiting from this, we present MonoNeuralFusion to
perform the online neural 3D reconstruction from monocular videos, by which the
3D scene geometry is efficiently generated and optimized during the on-the-fly
3D monocular scanning. The extensive comparisons with state-of-the-art
approaches show that our MonoNeuralFusion consistently generates much better
complete and fine-grained reconstruction results, both quantitatively and
qualitatively.Comment: 12 pages, 12 figure
Snowflake Point Deconvolution for Point Cloud Completion and Generation with Skip-Transformer
Most existing point cloud completion methods suffer from the discrete nature
of point clouds and the unstructured prediction of points in local regions,
which makes it difficult to reveal fine local geometric details. To resolve
this issue, we propose SnowflakeNet with snowflake point deconvolution (SPD) to
generate complete point clouds. SPD models the generation of point clouds as
the snowflake-like growth of points, where child points are generated
progressively by splitting their parent points after each SPD. Our insight into
the detailed geometry is to introduce a skip-transformer in the SPD to learn
the point splitting patterns that can best fit the local regions. The
skip-transformer leverages attention mechanism to summarize the splitting
patterns used in the previous SPD layer to produce the splitting in the current
layer. The locally compact and structured point clouds generated by SPD
precisely reveal the structural characteristics of the 3D shape in local
patches, which enables us to predict highly detailed geometries. Moreover,
since SPD is a general operation that is not limited to completion, we explore
its applications in other generative tasks, including point cloud
auto-encoding, generation, single image reconstruction, and upsampling. Our
experimental results outperform state-of-the-art methods under widely used
benchmarks.Comment: IEEE Transactions on Pattern Analysis and Machine Intelligence
(TPAMI), 2022. This work is a journal extension of our ICCV 2021 paper
arXiv:2108.04444 . The first two authors contributed equall
HiFi-123: Towards High-fidelity One Image to 3D Content Generation
Recent advances in text-to-image diffusion models have enabled 3D generation
from a single image. However, current image-to-3D methods often produce
suboptimal results for novel views, with blurred textures and deviations from
the reference image, limiting their practical applications. In this paper, we
introduce HiFi-123, a method designed for high-fidelity and multi-view
consistent 3D generation. Our contributions are twofold: First, we propose a
reference-guided novel view enhancement technique that substantially reduces
the quality gap between synthesized and reference views. Second, capitalizing
on the novel view enhancement, we present a novel reference-guided state
distillation loss. When incorporated into the optimization-based image-to-3D
pipeline, our method significantly improves 3D generation quality, achieving
state-of-the-art performance. Comprehensive evaluations demonstrate the
effectiveness of our approach over existing methods, both qualitatively and
quantitatively
- …